Revana: a comprehensive tool for regulatory variant analysis and visualization of cancer genomes

Abstract Motivation As non-coding driver mutations move more into the focus of cancer research, a comprehensive and easy-to-use software solution for regulatory variant analysis and data visualization is highly relevant. The interpretation of regulatory variants in large tumor genome cohorts requires specialized analysis and visualization of multiple layers of data, including for example breakpoints of structural variants, enhancer elements and additional available gene locus annotation, in the context of changes in gene expression. Results We introduce a user-friendly tool, Revana (REgulatory Variant ANAlysis), that can aggregate and visually represent regulatory variants from cancer genomes in a gene-centric manner. It requires whole-genome and RNA sequencing data of a cohort of tumor samples and creates interactive HTML reports summarizing the most important regulatory events. Availability and implementation Revana is implemented in R and JavaScript. It is available for download as an R package under <https://github.com/KiTZ-Heidelberg/revana>. Sample results can be viewed under <https://github.com/KiTZ-Heidelberg/revana-demo-report> and a short walkthrough is available under <https://github.com/KiTZ-Heidelberg/revana-demo-data>. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
The sensitive and specific detection of driver mutations is an important task in cancer genomics. While variants in the protein-coding space are nowadays mostly relatively easy to detect, the reliable detection of noncoding driver variants is more challenging yet equally relevant (Rheinbay et al., 2020). Non-coding drivers involve diverse mechanisms and usually function by altering the expression of certain, often spatially proximal genes. They do not alter the amino-acid structure of the translated proteins and are thus called regulatory variants. Prominent examples are recurrent single-nucleotide variants (SNVs) in the promoter of TERT (Horn et al., 2013;Huang et al., 2013), that upregulate the gene and thus maintain telomeres across many cancer entities (Remke, 2013;Vinagre et al., 2013).
Previously developed methods to discover regulatory variants from whole-genome (WGS) and RNA sequencing (RNA-seq) data include cis-X (Liu et al., 2020). It relies on allele specific as well as outlier high expression (OHE) to identify genes activated in cis, and subsequently associates them with potential driver mutation candidates. In contrast, CESAM (Weischenfeldt et al., 2017) uses linear regression of gene expression on somatic copy number alterations (CNAs), while Rheinbay et al. (2020) introduced the concepts of recurrent SV breakpoints and SV juxtaposition.
We provide an R package for the visualization and analysis of regulatory variants in cancer genome cohorts. Besides combining several features of the existing methods, we added consideration of genespecific regulatory regions, sample aggregation and comprehensive result evaluation features, that no existing solution provides yet to the best of our knowledge. Subsequent analysis of a well-described, published cohort of medulloblastoma genomes demonstrates that Revana (REgulatory Variant ANAlysis) can effectively detect and illustrate known regulatory variants.

Data input and preparation
Revana requires processed data from deep WGS and RNA-seq for a cohort of tumor samples as input. Tab-separated value files providing SNP markers, copy numbers, structural variants, somatic SNVs and InDels and gene expression feature counts are used as input (hg19 or hg38) and need to be created prior to the application of Revana with any of the many available tools (see Supplementary Material).

Identification of cis-activated genes
Revana adopts the approach used by Liu et al., combining allelespecific expression (ASE) and OHE to identify cis-activated genes that are potentially subject to variant-based upregulation. ASE statistics calculated from heterozygous SNP marker allele frequencies are compared to a balanced expression model established by Liu et al. OHE is determined using a leave-one-out test against a prefiltered set of other tumor samples of the cohort and eliminates the need for a precalculated gene expression reference matrix.

Associating genes with non-coding, regulatory variants
To identify potential regulatory somatic variants responsible for the upregulation of cis-activated genes, Revana matches genes and variants using three alternative approaches. As also utilized in cis-X, it first considers variants of the same topologically associating domain (TAD) in which the gene is located. Cis-X prioritized SVs and CNAs and only considered somatic SNVs/InDels if no other variant type could be detected. Yet, since point mutations are frequent events in cancer genomes, their presence alone cannot determine genuine regulatory variant candidates. We found, that a more rigorous selection of SNVs by transcription factor binding site analysis with FIMO (Grant et al., 2011) as used in cis-X, was not able to improve the predictive power with respect to cis-activation. Thus, we provide it only as a descriptive feature.
Conversely, Revana associates genes irrespective of their cisactivation status with all types of variants from the same TAD. Thus, the strength of association for each mutation type with gene cis-activation can be estimated and compared (e.g. odds ratio, OR SNV/InDel ¼ X, ORSV ¼ Y).
A second, complementary approach exploits gene-specific regulatory regions, so-called GeneHancers (Fishilevich et al., 2017). Promoter and enhancer elements of each gene under investigation allow for the matching of mutations beyond the TAD as well as a more selective choice of point mutations within the same TAD. Optionally, ChiP-seq data, ideally tissue-matched, can provide regulatory regions specific to the samples under investigation and is then used for gene-tovariant association, result annotation and visualization.

Automatic report generation and data visualization
Subsequently, Revana creates a comprehensive HTML report that helps evaluate, prioritize and visualize candidate events and allows to compare results of different tumor subgroups.
JavaScript-powered interactive illustrations show recurrently cisactivated genes across the cohort, highlight samples with different types of potential regulatory variants and provide multiple sorting and filtering options. This allows to prioritize between the discovered gene candidates and quickly elucidate genes repeatedly affected by comparable regulatory mechanisms. Moreover, recurrent SV juxtapositions can provide further support for systematically involved genetic mechanisms.
For each discovered gene candidate, a separate document shows the recurrence of mutations, the expression and allelic imbalance of the gene across the investigated samples and tumor subgroups as well as more explicit statistics. A gene locus plot illustrates all present somatic variants and thus helps interpreting the detected somatic genomic events. Finally, the integrative genome viewer (IGV) plugin (Robinson et al., 2022), facilitates interactive inspection of the gene TAD, other distant breakpoints and any available alignment directly within the browser.

Results
To demonstrate Revana's capabilities, we applied it to our cohort of 128 medulloblastomas from the International Cancer Genome Consortium (ICGC PedBrain) and identified a total of 3952 cisactivated candidate genes across the cohort, of which 1502 carry potential regulatory variants. Revana's filtering and sorting features quickly revealed the known top candidates (Fig. 1A). These include the two well-studied enhancer hijacking loci PRDM6 (Fig. 1B and C) (6/7 cases recognized; one case lacked sufficient SNP markers for ASE detection) (Northcott et al., 2017) and GFI1B (4/4 cases) (Northcott et al., 2014) as well as characteristic PVT1 fusions (Northcott et al., 2012). Gene locus illustrations summarize these genomic events (Fig. 1C). Our results conclusively demonstrate that Revana can identify known and novel non-coding regulatory variants in cohorts of cancer genomes.
Financial Support: none declared.
Conflict of Interest: none declared.